# SRE for Society

- [[pull]] [[sre]] [[society]] [[community]]
- [[push]] [[agora]]

A proposal to map the principles of [[Site Reliability Engineering]] (SRE) to the design and maintenance of resilient human communities and social systems.

## The Premise

If we view a [[community]] as a distributed system, we can apply the rigorous engineering practices used to keep high-availability systems (like [[Google]]) running to keep our social groups healthy. The goal is not to treat people like machines, but to build *systems* that are resilient to human error and conflict.

## Mappings

### SLOs -> [[Social Contracts]]
In SRE, a **Service Level Objective (SLO)** defines the acceptable level of reliability (e.g., "99.9% of requests will succeed").
In a community, this maps to a **[[Social Contract]]**.
*   **Question:** What is the acceptable level of "friction" or misunderstanding we tolerate before declaring the community "broken"?
*   **Application:** Explicitly defining expectations. We don't expect 100% harmony (which is impossible/stifling), but we define a threshold for acceptable discourse.

### Error Budgets -> [[Forgiveness Budgets]]
In SRE, an **Error Budget** is the allowed amount of downtime. If you have budget left, you can take risks and push code. If you burn it all, you must freeze changes.
In a community, this maps to a **[[Forgiveness Budget]]**.
*   **Question:** How much can a member "mess up" before they are banned?
*   **Application:** We need room to fail. If we demand perfection (0% error rate), we stifle innovation and authenticity. "Cancellation" happens when a community has zero error budget. We should track "social downtime" and use it to calibrate our tolerance.

### Incident Management -> [[Conflict Resolution]]
In SRE, when a system breaks, we declare an **Incident**. We assign an **Incident Commander** (IC). We follow a **Runbook**.
In a community, this maps to **[[Conflict Resolution]]** protocols.
*   **Question:** When a "flame war" breaks out, who is the IC? What is the runbook?
*   **Application:** Instead of ad-hoc piling on, we have a defined process. "This thread is overheating. [[User X]] is now the designated facilitator. We are entering 'Cool Down' mode as per Runbook A."

### Post-Mortems -> [[Restorative Justice Circles]]
In SRE, after an incident, we hold a **Blameless Post-Mortem**. The goal is not to fire the engineer who pushed the bug, but to understand *why* the system allowed the bug to be pushed.
In a community, this maps to **[[Restorative Justice Circles]]**.
*   **Question:** How do we learn from social failure without scapegoating?
*   **Application:** "We had a bad argument. Let's not just ban the agitator. Let's look at the systemic triggers. Did our UI encourage hostility? Was the context unclear?" Focus on *healing* the system and the relationships.

## See Also
- [[Reliability Engineering for Communities]]
- [[Community SRE]]
- [[Psychological Safety]]